A. The Routing Problem

When 50 tools is worse than 5

Agenda

  • A. The Routing Problem — Why tool selection matters (~10 min)
  • B. Classifier-Based Routing — LLM categorizes queries (~15 min)
  • C. Semantic Routing — Embeddings for tool matching (~20 min)
  • D. Programmatic Tool Calling — Batch tools without round-trips (~15 min)
  • E. Wrap-up — Key takeaways & lab preview (~5 min)

The Tool Proliferation Problem

Your agent starts with 3 tools. Then it grows:

  • search_web, search_news, search_academic
  • get_stock_price, get_financial_report, calculate_roi
  • read_pdf, summarize_document, extract_tables
  • send_email, create_calendar_event, query_database

Result: 20+ tools crammed into every context window.

Why More Tools ≠ Better Agent

Context Window Pollution

  • Every tool schema consumes tokens
  • 20 tools = ~2000 tokens per request
  • Less space for reasoning and memory
  • Higher latency, higher cost

Decision Confusion

  • Model sees too many options
  • May pick wrong tool
  • May call multiple similar tools
  • Inconsistent behavior

The Insight

The agent doesn’t need ALL tools for EVERY query. It only needs the RELEVANT ones.

The Solution: Dynamic Tool Routing

graph LR
    Q["User Query"] --> R["Router"]
    R -->|"Financial"| TF["Finance Tools<br/>(3 tools)"]
    R -->|"Academic"| TA["Academic Tools<br/>(4 tools)"]
    R -->|"General"| TG["General Tools<br/>(5 tools)"]
    TF --> A["Agent"]
    TA --> A
    TG --> A

    style R fill:#1C355E,stroke:#00C9A7,color:white
    style Q fill:#00C9A7,stroke:#1C355E,color:#1C355E
    style A fill:#FF7A5C,stroke:#1C355E,color:#1C355E

The router selects a subset of tools before the agent runs.

B. Classifier-Based Routing

Using a small LLM to categorize queries

How Classifier Routing Works

sequenceDiagram
    participant U as User
    participant C as Classifier LLM
    participant T as Tool Registry
    participant A as Agent

    U->>C: "What's Apple's stock price?"
    C->>C: Classify domain
    C-->>C: Domain = "Financial"
    C->>T: Get tools for "Financial"
    T-->>C: [get_stock_price, calculate_roi]
    C->>A: Agent + 2 tools
    A->>U: "$178.72"

The Classifier Prompt

A small, fast model (GPT-4o-mini, Claude Haiku) classifies the query:

CLASSIFIER_PROMPT = """Classify this query into ONE domain.

Domains: financial, academic, general, technical

Query: {query}

Respond with ONLY the domain name, nothing else.
"""

Trade-off: Fast and cheap (~$0.0001 per query), but requires predefined categories.

Classifier Router Implementation

class ClassifierRouter:
    def __init__(self, domain_tool_map: dict[str, list[str]]):
        self.domain_tool_map = domain_tool_map
        
    def classify(self, query: str) -> str:
        response = completion(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": CLASSIFIER_PROMPT.format(query=query)}]
        )
        return response.choices[0].message.content.strip().lower()
    
    def select_tools(self, query: str) -> list[Tool]:
        domain = self.classify(query)
        tool_names = self.domain_tool_map.get(domain, [])
        return [registry.get_tool(name) for name in tool_names]

When Classifier Routing Works Well

Good For Not Good For
Clear domain boundaries Overlapping domains
Fixed tool categories Ad-hoc tool additions
Low-latency requirements Nuanced query understanding
Budget constraints Cross-domain queries

Rule of thumb: If you can list your domains on one hand, classifier routing is a good starting point.

C. Semantic Routing

Embeddings for intelligent tool matching

From Categories to Similarity

Problem with classifiers: You must define categories upfront.

Semantic routing: Match the query to tools based on meaning, not labels.

graph LR
    Q["Query Embedding"] --> S1["Tool 1<br/>similarity: 0.92"]
    Q --> S2["Tool 2<br/>similarity: 0.87"]
    Q --> S3["Tool 3<br/>similarity: 0.45"]
    Q --> S4["Tool 4<br/>similarity: 0.23"]
    
    S1 --> T["Top-K Selected"]
    S2 --> T

    style Q fill:#1C355E,stroke:#00C9A7,color:white
    style S1 fill:#00C9A7,stroke:#1C355E,color:#1C355E
    style S2 fill:#00C9A7,stroke:#1C355E,color:#1C355E
    style T fill:#FF7A5C,stroke:#1C355E,color:#1C355E

How Semantic Routing Works

  1. Index phase: Embed all tool descriptions once at startup
  2. Query phase: Embed the user’s query
  3. Match phase: Compute cosine similarity, select top-K tools
# Index phase (once)
tool_embeddings = {
    "search": embed("search: Find information on the web"),
    "calculate": embed("calculate: Evaluate mathematical expressions"),
    "get_stock": embed("get_stock_price: Get current stock price"),
}

# Query phase (per request)
query_embedding = embed("What is Apple's stock worth?")
similarities = cosine_similarity(query_embedding, tool_embeddings)
top_tools = select_top_k(similarities, k=3)

The SemanticToolSelector Class

class SemanticToolSelector:
    def __init__(self, embedding_model: str):
        self._tool_embeddings: dict[str, list[float]] = {}

    def build_index(self):
        # Embed each tool's "name: description" text once at startup
        ...

    def select_tools(self, query: str, top_k: int = 5) -> list[Tool]:
        # 1. Embed the query
        # 2. Score each tool by cosine_similarity(query_vec, tool_vec)
        # 3. Return top-K by score
        ...

Cosine Similarity Explained

Measures the angle between two vectors:

\[\text{similarity} = \frac{A \cdot B}{\|A\| \|B\|}\]

Similarity Interpretation
1.0 Identical meaning
0.8+ Highly related
0.5-0.8 Somewhat related
< 0.5 Unrelated

Threshold choice: For tool selection, typically use top-K with K=3-5 rather than a fixed threshold.

The RoutedAgent Pattern

Combine routing with the agent loop for efficient execution:

class RoutedAgent:
    def run(self, query: str) -> str:
        selected_tools = self.router.select_tools(query)   # Route first
        return self.base_agent.run(query, tools=selected_tools)  # Then act

Routed vs Standard Agent

G cluster_standard Standard Agent cluster_routed Routed Agent Q1 Query A1 Agent (20 tools in context) Q1->A1 R1 Result A1->R1 RT Router Q2 Query Q2->RT A2 Agent (3 tools in context) RT->A2 R2 Result A2->R2

Context Window Savings

Scenario Tools in Context Tool Schema Tokens
Standard Agent 20 tools ~2000 tokens
Routed Agent (top-5) 5 tools ~500 tokens
Savings -15 tools ~1500 tokens

Impact:

  • 75% reduction in tool schema overhead
  • More room for memory and reasoning
  • Lower latency and cost per query

Classifier vs Semantic Routing

Aspect Classifier Semantic
Setup Define domains + mapping Just index tools
Flexibility Rigid categories Adapts to any query
Speed 1 LLM call (~200ms) 1 embedding call (~50ms)
Cost ~$0.0001 per query ~$0.00001 per query
Best for Clear domains Dynamic tool sets

Recommendation

Start with classifier routing for simplicity. Upgrade to semantic routing when your tool set grows beyond 15-20 tools or spans multiple domains.

When to Use Routed Agents

Use Case Recommended Approach
5-10 tools Standard agent (no routing needed)
10-20 tools, clear domains Classifier routing
20+ tools, mixed domains Semantic routing
Dynamic tool registration Semantic routing

Don’t Over-Engineer

If your agent has 5 tools and works fine, adding routing is premature optimization. Only add routing when tool count becomes a bottleneck.

D. Programmatic Tool Calling

Eliminate round-trips for multi-tool workflows

The Round-Trip Problem

Traditional tool calling: N tools = N model round-trips

sequenceDiagram
    participant M as Model
    participant T as Tools
    
    M->>T: Call tool_1
    T-->>M: Result (in context)
    M->>T: Call tool_2
    Note over M,T: ...
    M->>T: Call tool_n
    T-->>M: Result (in context)
    M->>M: Final answer

Note

Each round-trip: latency + token overhead + cost

How Programmatic Tool Calling Works

Claude writes code that calls your tools — no model in the loop:

sequenceDiagram
    participant M as Model
    participant C as Code Execution
    participant T as Tools
    
    M->>C: Write code to call tools
    C->>T: tool_1()
    T-->>C: Result
    Note over C,T: ...
    C->>T: tool_n()
    T-->>C: Result
    C->>C: Process/filter results
    C-->>M: Final output only
    M->>M: Response

Tip

Key insight: Tool results stay in code execution — only final output reaches context

The allowed_callers Pattern

Mark tools as callable from code execution:

tools = [
    {"type": "code_execution_20260120", "name": "code_execution"},
    {
        "name": "query_database",
        "description": "Execute a SQL query. Returns JSON.",
        "input_schema": {"type": "object", "properties": {"sql": {"type": "string"}}},
        "allowed_callers": ["code_execution_20260120"],  # <-- Key field
    },
]
allowed_callers Meaning
["direct"] Only model can call (default)
["code_execution_20260120"] Only from code execution
Both Callable either way

Example: Batch Processing

Query sales for 5 regions, return only the top performer:

# Claude writes this code internally
regions = ["West", "East", "Central", "North", "South"]
results = {}

for region in regions:
    data = await query_database(f"SELECT SUM(revenue) FROM sales WHERE region='{region}'")
    results[region] = data[0]["sum"]

top_region = max(results.items(), key=lambda x: x[1])
print(f"Top region: {top_region[0]} with ${top_region[1]:,}")

What reaches context: "Top region: West with $2,340,000" — not all 5 query results

Token & Latency Gains

Metric Traditional (5 tools) Programmatic
Model turns 5 round-trips 1 turn
Tool results in context All 5 None (filtered)
Latency ~15 seconds ~5 seconds
Tokens consumed ~10,000 ~2,000

The Multiplier Effect

Calling 10 tools directly uses ~10x the tokens of calling them programmatically and returning a summary.

Advanced Patterns

Early termination — stop as soon as success criteria met:

for endpoint in ["us-east", "eu-west", "apac"]:
    status = await check_health(endpoint)
    if status == "healthy":
        print(f"Found healthy: {endpoint}")
        break

Conditional tool selection — choose tool based on data:

file_info = await get_file_info(path)
if file_info["size"] < 10000:
    content = await read_full_file(path)
else:
    content = await read_file_summary(path)

When to Use Programmatic Calling

Good For Less Ideal
3+ dependent tool calls Single tool call
Large result filtering Simple responses
Loops over many items Need user feedback each step
Batch data processing Very fast single operations

Vendor Note

This pattern is currently Claude-specific (requires their code execution container). The concept generalizes — you can implement client-side with your own sandbox.

Alternative: Self-Managed Execution

Not using Claude? You can implement the pattern yourself:

  1. Give Claude a code execution tool (e.g., Python sandbox)
  2. Describe available functions in that environment
  3. Claude writes code → you execute → return result
# Your code execution tool
{
    "name": "execute_python",
    "description": "Run Python code. Available functions: search(), query_db(), send_email()",
    ...
}

Trade-off: More control, but you manage security and infrastructure.

E. Wrap-up

Key Takeaways

  1. Tool proliferation hurts performance — route to reduce context bloat
  2. Classifier routing uses a small LLM to categorize queries (fast, rigid)
  3. Semantic routing uses embeddings to match queries to tools (flexible, scalable)
  4. Programmatic calling eliminates round-trips — Claude writes code that calls tools
  5. Combine patterns: route to select tools, programmatic calling to execute efficiently

Lab Preview: Efficient Tool Calling

Part 1: Classifier Router

  • Implement ClassifierRouter
  • Map domains to tool sets

Part 2: Semantic Router

  • Build SemanticToolSelector
  • Index tool descriptions

Part 3: Routed Agents

  • Create RoutedAgent wrapper
  • Compare context usage

Part 4: Programmatic Calling

  • Explore allowed_callers pattern
  • Measure token savings

Time: 75 minutes

Questions?

Session 4 Complete